{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "allied-volume",
   "metadata": {
    "papermill": {
     "duration": 0.015123,
     "end_time": "2021-03-24T11:24:18.422928",
     "exception": false,
     "start_time": "2021-03-24T11:24:18.407805",
     "status": "completed"
    },
    "tags": []
   },
   "source": [
    "This notebook contains an example for teaching.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "senior-plane",
   "metadata": {
    "papermill": {
     "duration": 0.013777,
     "end_time": "2021-03-24T11:24:18.450894",
     "exception": false,
     "start_time": "2021-03-24T11:24:18.437117",
     "status": "completed"
    },
    "tags": []
   },
   "source": [
    "# Automatic Machine Learning with H2O AutoML using Wage Data from 2015"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "impressive-mistress",
   "metadata": {
    "papermill": {
     "duration": 0.014076,
     "end_time": "2021-03-24T11:24:18.478815",
     "exception": false,
     "start_time": "2021-03-24T11:24:18.464739",
     "status": "completed"
    },
    "tags": []
   },
   "source": [
    "We illustrate how to predict an outcome variable Y in a high-dimensional setting, using the AutoML package *H2O* that covers the complete pipeline from the raw dataset to the deployable machine learning model. In last few years, AutoML or automated machine learning has become widely popular among data science community. "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "linear-surgeon",
   "metadata": {
    "papermill": {
     "duration": 0.013915,
     "end_time": "2021-03-24T11:24:18.508556",
     "exception": false,
     "start_time": "2021-03-24T11:24:18.494641",
     "status": "completed"
    },
    "tags": []
   },
   "source": [
    "We can use AutoML as a benchmark and compare it to the methods that we used in the previous notebook where we applied one machine learning method after the other."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "olive-dream",
   "metadata": {
    "papermill": {
     "duration": 0.526133,
     "end_time": "2021-03-24T11:24:19.048186",
     "exception": false,
     "start_time": "2021-03-24T11:24:18.522053",
     "status": "completed"
    },
    "tags": []
   },
   "outputs": [],
   "source": [
    "# load the H2O package\n",
    "library(h2o)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "individual-representative",
   "metadata": {
    "papermill": {
     "duration": 0.182687,
     "end_time": "2021-03-24T11:24:19.247049",
     "exception": false,
     "start_time": "2021-03-24T11:24:19.064362",
     "status": "completed"
    },
    "tags": []
   },
   "outputs": [],
   "source": [
    "# load the data set\n",
    "load(\"../input/wage2015-inference/wage2015_subsample_inference.Rdata\")\n",
    "\n",
    "# split the data\n",
    "set.seed(1234)\n",
    "training <- sample(nrow(data), nrow(data)*(3/4), replace=FALSE)\n",
    "\n",
    "train <- data[training,]\n",
    "test <- data[-training,]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "banned-relief",
   "metadata": {
    "papermill": {
     "duration": 5.013121,
     "end_time": "2021-03-24T11:24:24.275376",
     "exception": false,
     "start_time": "2021-03-24T11:24:19.262255",
     "status": "completed"
    },
    "tags": []
   },
   "outputs": [],
   "source": [
    "# start h2o cluster\n",
    "h2o.init()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "boring-cooper",
   "metadata": {
    "papermill": {
     "duration": 7.867453,
     "end_time": "2021-03-24T11:24:32.161116",
     "exception": false,
     "start_time": "2021-03-24T11:24:24.293663",
     "status": "completed"
    },
    "tags": []
   },
   "outputs": [],
   "source": [
    "# convert data as h2o type\n",
    "train_h = as.h2o(train)\n",
    "test_h = as.h2o(test)\n",
    "\n",
    "# have a look at the data\n",
    "h2o.describe(train_h)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "invalid-blade",
   "metadata": {
    "papermill": {
     "duration": 40.72251,
     "end_time": "2021-03-24T11:25:12.913500",
     "exception": false,
     "start_time": "2021-03-24T11:24:32.190990",
     "status": "completed"
    },
    "tags": []
   },
   "outputs": [],
   "source": [
    "# define the variables\n",
    "y = 'lwage'\n",
    "x = setdiff(names(data), c('wage','occ2', 'ind2'))\n",
    "            \n",
    "# run AutoML for 10 base models and a maximal runtime of 100 seconds\n",
    "aml = h2o.automl(x=x,y = y,\n",
    "                  training_frame = train_h,\n",
    "                  leaderboard_frame = test_h,\n",
    "                  max_models = 10,\n",
    "                  seed = 1,\n",
    "                  max_runtime_secs = 100\n",
    "                 )\n",
    "# AutoML Leaderboard\n",
    "lb = aml@leaderboard\n",
    "print(lb, n = nrow(lb))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "entertaining-making",
   "metadata": {
    "papermill": {
     "duration": 0.028316,
     "end_time": "2021-03-24T11:25:12.970888",
     "exception": false,
     "start_time": "2021-03-24T11:25:12.942572",
     "status": "completed"
    },
    "tags": []
   },
   "source": [
    "We see that two Stacked Ensembles are at the top of the leaderboard. Stacked Ensembles often outperform a single model. The out-of-sample (test) MSE of the leading model is given by"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "spread-resident",
   "metadata": {
    "papermill": {
     "duration": 0.29223,
     "end_time": "2021-03-24T11:25:13.291880",
     "exception": false,
     "start_time": "2021-03-24T11:25:12.999650",
     "status": "completed"
    },
    "tags": []
   },
   "outputs": [],
   "source": [
    "aml@leaderboard$mse[1]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "embedded-tomorrow",
   "metadata": {
    "papermill": {
     "duration": 0.026882,
     "end_time": "2021-03-24T11:25:13.345629",
     "exception": false,
     "start_time": "2021-03-24T11:25:13.318747",
     "status": "completed"
    },
    "tags": []
   },
   "source": [
    "The in-sample performance can be evaluated by"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "compound-sleeping",
   "metadata": {
    "papermill": {
     "duration": 0.060902,
     "end_time": "2021-03-24T11:25:13.435001",
     "exception": false,
     "start_time": "2021-03-24T11:25:13.374099",
     "status": "completed"
    },
    "tags": []
   },
   "outputs": [],
   "source": [
    "aml@leader"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "peaceful-grave",
   "metadata": {
    "papermill": {
     "duration": 0.027663,
     "end_time": "2021-03-24T11:25:13.491063",
     "exception": false,
     "start_time": "2021-03-24T11:25:13.463400",
     "status": "completed"
    },
    "tags": []
   },
   "source": [
    "This is in line with our previous results. To understand how the ensemble works, let's take a peek inside the Stacked Ensemble \"All Models\" model.  The \"All Models\" ensemble is an ensemble of all of the individual models in the AutoML run.  This is often the top performing model on the leaderboard."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "piano-content",
   "metadata": {
    "papermill": {
     "duration": 0.254859,
     "end_time": "2021-03-24T11:25:13.773369",
     "exception": false,
     "start_time": "2021-03-24T11:25:13.518510",
     "status": "completed"
    },
    "tags": []
   },
   "outputs": [],
   "source": [
    "model_ids <- as.data.frame(aml@leaderboard$model_id)[,1]\n",
    "# Get the \"All Models\" Stacked Ensemble model\n",
    "se <- h2o.getModel(grep(\"StackedEnsemble_AllModels\", model_ids, value = TRUE)[1])\n",
    "# Get the Stacked Ensemble metalearner model\n",
    "metalearner <- se@model$metalearner_model\n",
    "h2o.varimp(metalearner)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "amber-meditation",
   "metadata": {
    "papermill": {
     "duration": 0.028741,
     "end_time": "2021-03-24T11:25:13.832588",
     "exception": false,
     "start_time": "2021-03-24T11:25:13.803847",
     "status": "completed"
    },
    "tags": []
   },
   "source": [
    "The table above gives us the variable importance of the metalearner in the ensemble. The AutoML Stacked Ensembles use the default metalearner algorithm (GLM with non-negative weights), so the variable importance of the metalearner is actually the standardized coefficient magnitudes of the GLM. \n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "reserved-trademark",
   "metadata": {
    "papermill": {
     "duration": 0.421132,
     "end_time": "2021-03-24T11:25:14.283327",
     "exception": false,
     "start_time": "2021-03-24T11:25:13.862195",
     "status": "completed"
    },
    "tags": []
   },
   "outputs": [],
   "source": [
    "h2o.varimp_plot(metalearner)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "adjustable-repository",
   "metadata": {
    "papermill": {
     "duration": 0.030956,
     "end_time": "2021-03-24T11:25:14.345344",
     "exception": false,
     "start_time": "2021-03-24T11:25:14.314388",
     "status": "completed"
    },
    "tags": []
   },
   "source": [
    "## Generating Predictions Using Leader Model\n",
    "\n",
    "We can also generate predictions on a test sample using the leader model object."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "answering-uganda",
   "metadata": {
    "papermill": {
     "duration": 1.175942,
     "end_time": "2021-03-24T11:25:15.551977",
     "exception": false,
     "start_time": "2021-03-24T11:25:14.376035",
     "status": "completed"
    },
    "tags": []
   },
   "outputs": [],
   "source": [
    "pred <- as.matrix(h2o.predict(aml@leader,test_h)) # make prediction using x data from the test sample\n",
    "head(pred)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "conceptual-victor",
   "metadata": {
    "papermill": {
     "duration": 0.033008,
     "end_time": "2021-03-24T11:25:15.618587",
     "exception": false,
     "start_time": "2021-03-24T11:25:15.585579",
     "status": "completed"
    },
    "tags": []
   },
   "source": [
    "This allows us to estimate the out-of-sample (test) MSE and the standard error as well."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "steady-blend",
   "metadata": {
    "papermill": {
     "duration": 0.137339,
     "end_time": "2021-03-24T11:25:15.788713",
     "exception": false,
     "start_time": "2021-03-24T11:25:15.651374",
     "status": "completed"
    },
    "tags": []
   },
   "outputs": [],
   "source": [
    "y_test <- as.matrix(test_h$lwage)\n",
    "summary(lm((y_test-pred)^2~1))$coef[1:2]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "configured-spray",
   "metadata": {
    "papermill": {
     "duration": 0.034945,
     "end_time": "2021-03-24T11:25:15.858930",
     "exception": false,
     "start_time": "2021-03-24T11:25:15.823985",
     "status": "completed"
    },
    "tags": []
   },
   "source": [
    "We observe both a lower MSE and a lower standard error compared to our previous results (see [here](https://www.kaggle.com/janniskueck/pm3-notebook-newdata))."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "sufficient-adjustment",
   "metadata": {
    "papermill": {
     "duration": 0.074909,
     "end_time": "2021-03-24T11:25:15.967065",
     "exception": false,
     "start_time": "2021-03-24T11:25:15.892156",
     "status": "completed"
    },
    "tags": []
   },
   "outputs": [],
   "source": [
    "h2o.shutdown(prompt = F)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "R",
   "language": "R",
   "name": "ir"
  },
  "language_info": {
   "codemirror_mode": "r",
   "file_extension": ".r",
   "mimetype": "text/x-r-source",
   "name": "R",
   "pygments_lexer": "r",
   "version": "3.6.3"
  },
  "papermill": {
   "default_parameters": {},
   "duration": 62.011749,
   "end_time": "2021-03-24T11:25:17.053885",
   "environment_variables": {},
   "exception": null,
   "input_path": "__notebook__.ipynb",
   "output_path": "__notebook__.ipynb",
   "parameters": {},
   "start_time": "2021-03-24T11:24:15.042136",
   "version": "2.3.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
